GVPT Maths Boot Camp
Exploratory Data Analysis
Learning objectives
Learn how to generate questions about your data
Learn how to discern interesting relations in your data
Use your new data science tools to better understand your data
Two basic questions to guide your EDA
- What type of variation occurs within my variables?
- What type of covariation occurs between my variables?
Examining gapminder
library(gapminder)
library(dplyr)
head(gapminder)
# A tibble: 6 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
Variation: Factors
How many countries are there in our data set?
gapminder |>
distinct(country) |>
nrow()
How many continents?
gapminder |>
distinct(continent) |>
nrow()
EXERCISE: How many countries in each continent?
Variation: Numeric
What is the earliest and latest year we cover?
summarise(gapminder, min(year), max(year))
# A tibble: 1 × 2
`min(year)` `max(year)`
<int> <int>
1 1952 2007
What about our other numeric variables?
summarise(gapminder, across(lifeExp:gdpPercap, ~ quantile(.x)))
# A tibble: 5 × 3
lifeExp pop gdpPercap
<dbl> <dbl> <dbl>
1 23.6 60011 241.
2 48.2 2793664 1202.
3 60.7 7023596. 3532.
4 70.8 19585222. 9325.
5 82.6 1318683096 113523.
The Five Number Summary
The five number summary is a useful way to summarise numeric data. Consists of the:
Visualising the Five Number Summary
library(ggplot2)
ggplot(gapminder, aes(y = lifeExp)) +
geom_boxplot() +
theme_minimal()
Visualising the IQR for groups
library(ggplot2)
ggplot(gapminder, aes(x = continent, y = lifeExp)) +
geom_boxplot() +
theme_minimal()
Visualising the distribution of numeric variables
ggplot(gapminder, aes(x = lifeExp)) +
geom_histogram() +
theme_minimal()
Visualising the distribution of numeric variables
ggplot(gapminder, aes(x = lifeExp)) +
geom_density() +
theme_minimal()
Visualising the distribution of numeric variables
ggplot(gapminder, aes(x = lifeExp, fill = continent)) +
geom_density(alpha = 0.5) +
theme_minimal()
Visualising counts
gapminder |>
distinct(continent, country) |>
count(continent) |>
ggplot(aes(x = n, y = reorder(continent, n))) +
geom_col() +
theme_minimal()
Identifying unusual values
ggplot(gapminder, aes(x = gdpPercap)) +
geom_histogram() +
theme_minimal()
Identifying unusual values
ggplot(gapminder, aes(x = gdpPercap)) +
geom_boxplot() +
theme_minimal()
Identifying relationships in your data
Does one variable tend to move in the same direction as another?
ggplot(gapminder, aes(x = log(gdpPercap), y = lifeExp)) +
geom_point() +
theme_minimal()
A preview of linear regression
There has to be an easier way!
A quick look with glimpse():
A quick summary with skim():
install.packages("skimr")
skimr::skim(gapminder)
Summary
Today you:
Learnt how to explore and visualise interesting relations in your data
Used your new data science tools to better understand your data